DROPS

Document

DOI: 10.4230/OASIcs.LDK.2019.7

CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation

Authors: Christian Chiarcos and Niko Schenk

Published in: OASIcs, Volume 70, 2nd Conference on Language, Data and Knowledge (LDK 2019)

Abstract

The proper detection of tokens in of running text represents the initial processing step in modular NLP pipelines. But strategies for defining these minimal units can differ, and conflicting analyses of the same text seriously limit the integration of subsequent linguistic annotations into a shared representation. As a solution, we introduce CoNLL Merge, a practical tool for harmonizing TSV-related data models, as they occur, e.g., in multi-layer corpora with non-sequential, concurrent tokenizations, but also in ensemble combinations in Natural Language Processing. CoNLL Merge works unsupervised, requires no manual intervention or external data sources, and comes with a flexible API for fully automated merging routines, validity and sanity checks. Users can chose from several merging strategies, and either preserve a reference tokenization (with possible losses of annotation granularity), create a common tokenization layer consisting of minimal shared subtokens (loss-less in terms of annotation granularity, destructive against a reference tokenization), or present tokenization clashes (loss-less and non-destructive, but introducing empty tokens as place-holders for unaligned elements). We demonstrate the applicability of the tool on two use cases from natural language processing and computational philology.

Cite as

Christian Chiarcos and Niko Schenk. CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation. In 2nd Conference on Language, Data and Knowledge (LDK 2019). Open Access Series in Informatics (OASIcs), Volume 70, pp. 7:1-7:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{chiarcos_et_al:OASIcs.LDK.2019.7,
  author =	{Chiarcos, Christian and Schenk, Niko},
  title =	{{CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation}},
  booktitle =	{2nd Conference on Language, Data and Knowledge (LDK 2019)},
  pages =	{7:1--7:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-105-4},
  ISSN =	{2190-6807},
  year =	{2019},
  volume =	{70},
  editor =	{Eskevich, Maria and de Melo, Gerard and F\"{a}th, Christian and McCrae, John P. and Buitelaar, Paul and Chiarcos, Christian and Klimek, Bettina and Dojchinovski, Milan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2019.7},
  URN =		{urn:nbn:de:0030-drops-103717},
  doi =		{10.4230/OASIcs.LDK.2019.7},
  annote =	{Keywords: data heterogeneity, tokenization, tab-separated values (TSV) format, linguistic annotation, merging}
}

@InProceedings{chiarcos_et_al:OASIcs.LDK.2019.7,
  author =	{Chiarcos, Christian and Schenk, Niko},
  title =	{{CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation}},
  booktitle =	{2nd Conference on Language, Data and Knowledge (LDK 2019)},
  pages =	{7:1--7:14},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-105-4},
  ISSN =	{2190-6807},
  year =	{2019},
  volume =	{70},
  editor =	{Eskevich, Maria and de Melo, Gerard and F\"{a}th, Christian and McCrae, John P. and Buitelaar, Paul and Chiarcos, Christian and Klimek, Bettina and Dojchinovski, Milan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/OASIcs.LDK.2019.7},
  URN =		{urn:nbn:de:0030-drops-103717},
  doi =		{10.4230/OASIcs.LDK.2019.7},
  annote =	{Keywords: data heterogeneity, tokenization, tab-separated values (TSV) format, linguistic annotation, merging}
}

Search Results

Documents authored by Schenk, Niko

CoNLL-Merge: Efficient Harmonization of Concurrent Tokenization and Textual Variation

Abstract

Cite as

Thanks for your feedback!

Could not send message